The availability of satellite-based atmospheric column observations are limited by the surface and meteorological conditions of the sensing date, and therefore missing values (pixels) exist in the dataset. Before using these satellite-derived observations as input variables to model the ground-level NO2 concentration, the missing pixels should be first imputed. In principle, the method is to model the relationship between the satellite atmospheric column observation and some covariates (which do not come with missing values) using the pixel-days without missing values, and to implement this model to predict the missing pixels in the satellite-based atmospheric column observation datasets. This document demonstrates the development and implementation of the imputation model for the OMI-NO2 product.
library(stars) ; library(sf) ; library(raster)
library(dplyr) ; library(tidyr)
library(ggplot2) ; library(ggsci)
library(lubridate) ; library(stringr)
library(randomForest) ; library(ranger)
Based on literature reviews, the following variables are selected to model the missing OMI NO2 observations. * CAMS Global Reanalysis data (modeled-NO2 and NO) at various time (00, 03, 06, 09, 12, 15, 18, 21)
* ERA5 meteorological variables at various time (00, 03, 06, 09, 12, 15, 18, 21)
* temperature at 2m, surface pressure, total precipitation, boundary layer height, total cloud cover, wind speed and wind direction (calculated from u and v wind components)
* altitude (EU-DEM v1.1)
* coordinates of the cells: x, y
* Day Of Year
# file paths
files_OMI <- list.files("1_data/raw/OMI-NO2" , "_AOI.tif$" , full.names = TRUE)
# timestamps
ts_OMI_raw <- basename(files_OMI) %>%
str_extract("\\d{8}") %>%
as_date()
# read as a 3D array: x, y, date
OMI_raw <- read_stars(files_OMI , along = list(date = ts_OMI_raw)) %>%
setNames("value")
## stars object with 3 dimensions and 1 attribute
## attribute(s):
## value
## Min. :-9.982e+14
## 1st Qu.: 4.562e+14
## Median : 1.389e+15
## Mean : 1.968e+15
## 3rd Qu.: 2.728e+15
## Max. : 5.517e+16
## NA's :83574
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 24 5.25 0.25 GEOGCS["unnamed ellipse",... FALSE NULL [x]
## y 1 13 45.25 0.25 GEOGCS["unnamed ellipse",... FALSE NULL [y]
## date 1 365 2019-01-01 1 days Date NA NULL
The dataset is a 3D spatialtemporal array datacube (stars).
# 3D array: x, y, time
CAMS_raw <- read_stars("1_data/raw/ECMWF-CAMS-NO2/CAMS-NO2.nc") %>%
st_set_crs(st_crs(4326)) # long-lat coordinates
# 2D array: x, y
DEM_raw <- read_stars("1_data/raw/EU-DEM-1.1/eu_dem_v11_E40N20_AOI.tif")
# 3D array: x, y, time
ERA_raw <- read_stars("1_data/raw/ECMWF-ERA5/ERA5-meteorology.nc") %>%
st_set_crs(st_crs(4326)) # long-lat coordinates
CH <- st_read("1_data/raw/Switzerland_shapefile/CHE_adm0.shp") %>%
st_geometry() %>%
st_simplify(preserveTopology = TRUE , dTolerance = 0.05)
AOI <- st_read("1_data/raw/Switzerland_shapefile/AOI_4326.shp")
To do the calculation and modeling based on spatially and/or temporally co-locating pixels of the various predictor variables, all of the datasets need to have the same spatial dimensions (x,y aka spatial resolution and coordinate systems) and/or temporal dimension (daily).
For the imputation model of OMI-NO2, all the predictor variables are resampled and reshaped to the dimension of OMI, which is:
Here are the resampling methods applied at this step:
Downscaling (0.75˚ to 0.25˚) with nearest neighbor resampling
CAMS_rsNN <- CAMS_raw %>%
st_warp(dest = OMI_raw)
Separate the resampled CAMS data into sub-datasets(8*2=16) to have the same dimension as OMI (3-hour to daily)
subset.CAMS_hour <- CAMS_rsNN %>%
st_get_dimension_values(3) %>%
hour() %>%
unique()
# 2 attributes: "tcno2" "tc_no"
subset.CAMS_var <- CAMS_rsNN %>% names()
# subset into 16 3D arrays
for(h in subset.CAMS_hour){
for(v in subset.CAMS_var){
# subset the raw dataset: at hour h and for variable (attribute) v
CAMS_rs_temp <- CAMS_rsNN %>%
filter(hour(time) == h) %>%
select(all_of(v)) %>%
setNames(v) %>%
# dimension: date
st_set_dimensions(3 ,
values = as_date(st_get_dimension_values(. , 3)) ,
names = "date")
# name the object name as CAMS_rs_{Variable}_{Hour}
assign(paste0("CAMS_rs_" , v , "_" , h) , CAMS_rs_temp)
rm(CAMS_rs_temp)
}
}
# clean intermediate objects of the loop
rm(h , v , subset.CAMS_hour , subset.CAMS_var)
Output: CAMS_rs_{Variable}_{Hour} with
{Variable}: tc_no, tcno2{Hour}: 0, 3, 6, 9, 12, 15, 18, 21 ;and each of these as a 3D array. Here’s an example comaring the original and resampled CAMS data on a random date:
Bilinear interpolation
# 25*25m -> 7*13km upscaling for continuous data
# bilinear interpolation
DEM_rs <- DEM_raw %>%
st_warp(OMI_raw[,,,1] %>% split("date") ,
use_gdal = TRUE , method = "bilinear") %>%
setNames("DEM")
Nearest-neighbor resampling
ERA_rsNN <- ERA_raw %>%
st_warp(dest = OMI_raw)
Separate the resampled ERAA5 data into sub-datasets(8*2=16) to have the same dimension as OMI (3-hour to daily)
subset.ERA_hour <- ERA_rsNN %>%
st_get_dimension_values(3) %>%
hour() %>%
unique()
# 7 attributes: "u10" "v10" "t2m" "blh" "sp" "tcc" "tp"
subset.ERA_var <- ERA_rsNN %>% names()
# subset into 16 3D arrays
for(h in subset.ERA_hour){
for(v in subset.ERA_var){
# subset the raw dataset: at hour h and for variable (attribute) v
ERA_rs_temp <- ERA_rsNN %>%
filter(hour(time) == h) %>%
select(all_of(v)) %>%
setNames(v) %>%
# # dimension: date
st_set_dimensions(3 ,
values = as_date(st_get_dimension_values(. , 3)) ,
names = "date")
# name the object name as ERA_rs_{Variable}_{Hour}
assign(paste0("ERA_rs_" , v , "_" , h) , ERA_rs_temp)
rm(ERA_rs_temp)
}
}
# clean intermediate objects of the loop
rm(h , v , subset.ERA_hour , subset.ERA_var)
Output: ERA_rs_{Variable}_{Hour} with
{Variable}: u10, v10, t2m, blh, sp, tcc, tp{Hour}: 0, 3, 6, 9, 12, 15, 18, 21 ;and each of these as a 3D array.
Conversion formula:
subset.ERA_hour <- ERA_raw %>%
st_get_dimension_values(3) %>%
hour() %>%
unique()
for(h in subset.ERA_hour){
ERA_rs_wind_temp <- c(
get(paste0("ERA_rs_u10_" , h)) ,
get(paste0("ERA_rs_v10_" , h))
) %>%
# wd = atan2(-u10,-v10)
# ws = sqrt(u10^2 + v10^2)
transmute(wd = atan2(-u10 , -v10) ,
ws = sqrt(u10^2+v10^2))
# name the object name as CAMS_rs_wind_{Hour}
assign(paste0("ERA_rs_wind_" , h) , ERA_rs_wind_temp)
}
# clean
rm(h , subset.ERA_hour , ERA_rs_wind_temp)
# spatialtemporal datasets: 3D (x,y,date) + attributes --------------------------------
spatialtemporal <- c(
# OMI
OMI_raw %>%
setNames("OMI_NO2") ,
# CAMS NO2
c(CAMS_rs_tcno2_0 , CAMS_rs_tcno2_3 , CAMS_rs_tcno2_6 , CAMS_rs_tcno2_9 ,
CAMS_rs_tcno2_12 , CAMS_rs_tcno2_15 , CAMS_rs_tcno2_18 , CAMS_rs_tcno2_21) %>%
setNames(c("CAMS_NO2_00" , "CAMS_NO2_03" , "CAMS_NO2_06" , "CAMS_NO2_09" ,
"CAMS_NO2_12" , "CAMS_NO2_15" , "CAMS_NO2_18" , "CAMS_NO2_21")) ,
# CAMS NO
c(CAMS_rs_tc_no_0 , CAMS_rs_tc_no_3 , CAMS_rs_tc_no_6 , CAMS_rs_tc_no_9 ,
CAMS_rs_tc_no_12 , CAMS_rs_tc_no_15 , CAMS_rs_tc_no_18 , CAMS_rs_tc_no_21) %>%
setNames(c("CAMS_NO_00" , "CAMS_NO_03" , "CAMS_NO_06" , "CAMS_NO_09" ,
"CAMS_NO_12" , "CAMS_NO_15" , "CAMS_NO_18" , "CAMS_NO_21")) ,
# ERA blh
c(ERA_rs_blh_0 , ERA_rs_blh_3 , ERA_rs_blh_6 , ERA_rs_blh_9 ,
ERA_rs_blh_12 , ERA_rs_blh_15 , ERA_rs_blh_18 , ERA_rs_blh_21) %>%
setNames(c("blh_00" , "blh_03" , "blh_06" , "blh_09" ,
"blh_12" , "blh_15" , "blh_18" , "blh_21")) ,
# ERA temperature
c(ERA_rs_t2m_0 , ERA_rs_t2m_3 , ERA_rs_t2m_6 , ERA_rs_t2m_9 ,
ERA_rs_t2m_12 , ERA_rs_t2m_15 , ERA_rs_t2m_18 , ERA_rs_t2m_21) %>%
setNames(c("t2m_00" , "t2m_03" , "t2m_06" , "t2m_09" ,
"t2m_12" , "t2m_15" , "t2m_18" , "t2m_21")) ,
# ERA surface pressure
c(ERA_rs_sp_0 , ERA_rs_sp_3 , ERA_rs_sp_6 , ERA_rs_sp_9 ,
ERA_rs_sp_12 , ERA_rs_sp_15 , ERA_rs_sp_18 , ERA_rs_sp_21) %>%
setNames(c("sp_00" , "sp_03" , "sp_06" , "sp_09" ,
"sp_12" , "sp_15" , "sp_18" , "sp_21")) ,
# ERA total cloud cover
c(ERA_rs_tcc_0 , ERA_rs_tcc_3 , ERA_rs_tcc_6 , ERA_rs_tcc_9 ,
ERA_rs_tcc_12 , ERA_rs_tcc_15 , ERA_rs_tcc_18 , ERA_rs_tcc_21) %>%
setNames(c("tcc_00" , "tcc_03" , "tcc_06" , "tcc_09" ,
"tcc_12" , "tcc_15" , "tcc_18" , "tcc_21")) ,
# ERA total precipitation
c(ERA_rs_tp_0 , ERA_rs_tp_3 , ERA_rs_tp_6 , ERA_rs_tp_9 ,
ERA_rs_tp_12 , ERA_rs_tp_15 , ERA_rs_tp_18 , ERA_rs_tp_21) %>%
setNames(c("tp_00" , "tp_03" , "tp_06" , "tp_09" ,
"tp_12" , "tp_15" , "tp_18" , "tp_21")) ,
# ERA wind
c(ERA_rs_wind_0 , ERA_rs_wind_3 , ERA_rs_wind_6 , ERA_rs_wind_9 ,
ERA_rs_wind_12 , ERA_rs_wind_15 , ERA_rs_wind_18 , ERA_rs_wind_21) %>%
setNames(c("wd_00", "ws_00", "wd_03", "ws_03", "wd_06", "ws_06", "wd_09", "ws_09",
"wd_12", "ws_12", "wd_15", "ws_15", "wd_18", "ws_18", "wd_21", "ws_21"))
) %>%
st_set_dimensions(3 , values = yday(st_get_dimension_values(. , 3)) , names = "DOY")
# spatial datasets: 2D (x,y) + attributes --------------------------------
spatial <- c(
# DEM
DEM_rs
)
After reshaping the data, spatialtemporal is a 3D array (x,y,day) and spatial is a 2D array (x,y). Both matches the resolution of OMI.
Covert the spatialtemporal and spatial arrays to data.frame for the model:
imputation_df <- spatialtemporal %>%
as.data.frame() %>%
mutate(across(everything() , as.numeric)) %>%
left_join(spatial %>% as.data.frame() ,
by = c("x" , "y")) %>%
# Some pixels are outside of the AOI and are always NA,
# should be removed from the data.frame so that it wouldn’t be counted.
full_join(
# AOI mask in raster form
AOI %>%
st_rasterize(template = spatial) %>%
# is.AOI: a column with TRUE/FALSE
mutate(is.AOI = ifelse(is.na(FID) , FALSE , TRUE)) %>%
as.data.frame(),
by = c("x" , "y")
) %>%
# remove the pixels that are outside AOI (is.AOI=FALSE)
filter(is.AOI) %>%
# remove the FID and is.AOI columns from the AOI mask
select(-FID , -is.AOI)
The area of interest is divided into large grids and randomly assigned with IDs (1-10) to apply a spatially-blocked cross validation.
# spatially blocked k-fold cross validation
k_fold <- 10
# spatially-blocked by grids across AOI
set.seed(2201)
# make grids and assign cross validation index
spatialGrid_CV <- AOI %>%
# reporject to OMI
st_transform(st_crs(spatial)) %>%
# make grids
st_make_grid(n = c(10,4)) %>%
st_as_sf() %>%
# randomly assign cross validation index
mutate(spatial_CV = sample(1:k_fold , nrow(.) , replace = TRUE))
# rasterize the grids with CV-index to stars
spatialGrid_CV_stars <- st_rasterize(spatialGrid_CV["spatial_CV"], template = spatial) %>%
mutate(spatial_CV = as.factor(spatial_CV))
ggplot() +
# spatial-CV
geom_stars(data = spatialGrid_CV_stars) +
# OMI grids
geom_sf(data = spatial %>%
# convert the grids to polygon to visualize the grid cells
st_as_sf(as_points = FALSE) ,
color = "azure1" , fill = NA , alpha = 0.5 , size = 0.2) +
# Switzerland
geom_sf(data = CH , fill = NA , color = "white") +
# AOI
geom_sf(data = AOI , fill = NA , color = "azure2") +
scale_fill_npg() +
coord_sf(expand = FALSE) +
labs(x = "Longtitude" , y = "Latitude" , fill = "k-fold \ncross \nvalidation \ngroup")
## Warning: Removed 24 rows containing missing values (geom_raster).
imputation_df <- spatialGrid_CV_stars %>%
setNames("spatial_CV") %>%
as.data.frame() %>%
# join to the input data
right_join(imputation_df , by = c("x" , "y"))
Take a look at the full data imputation_df:
str(imputation_df)
## 'data.frame': 73000 obs. of 78 variables:
## $ x : num 9.12 9.12 9.12 9.12 9.12 ...
## $ y : num 45.4 45.4 45.4 45.4 45.4 ...
## $ spatial_CV : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DOY : num 1 2 3 4 5 6 7 8 9 10 ...
## $ OMI_NO2 : num NA 6.11e+15 NA 1.77e+16 NA ...
## $ CAMS_NO2_00: num 2.89e-05 9.26e-06 3.48e-06 5.33e-06 1.81e-05 ...
## $ CAMS_NO2_03: num 2.49e-05 9.35e-06 3.46e-06 5.13e-06 1.82e-05 ...
## $ CAMS_NO2_06: num 2.41e-05 7.98e-06 3.71e-06 5.45e-06 1.89e-05 ...
## $ CAMS_NO2_09: num 1.80e-05 4.75e-06 3.27e-06 6.12e-06 1.76e-05 ...
## $ CAMS_NO2_12: num 1.58e-05 3.48e-06 3.16e-06 9.17e-06 1.03e-05 ...
## $ CAMS_NO2_15: num 2.01e-05 3.62e-06 5.32e-06 1.42e-05 9.38e-06 ...
## $ CAMS_NO2_18: num 2.41e-05 4.72e-06 6.32e-06 1.94e-05 8.57e-06 ...
## $ CAMS_NO2_21: num 2.30e-05 4.37e-06 5.66e-06 2.33e-05 6.81e-06 ...
## $ CAMS_NO_00 : num 2.79e-06 9.56e-07 6.69e-09 9.69e-07 1.59e-06 ...
## $ CAMS_NO_03 : num 3.43e-06 1.27e-06 2.05e-08 1.55e-06 2.02e-06 ...
## $ CAMS_NO_06 : num 3.27e-06 9.71e-07 8.43e-08 2.02e-06 2.31e-06 ...
## $ CAMS_NO_09 : num 7.98e-06 1.98e-06 1.02e-06 3.73e-06 4.73e-06 ...
## $ CAMS_NO_12 : num 7.33e-06 1.24e-06 1.33e-06 4.88e-06 4.83e-06 ...
## $ CAMS_NO_15 : num 3.43e-06 6.54e-07 8.91e-07 2.45e-06 2.60e-06 ...
## $ CAMS_NO_18 : num 3.51e-07 5.78e-09 2.92e-07 7.32e-07 4.02e-07 ...
## $ CAMS_NO_21 : num 7.34e-07 6.69e-09 6.70e-07 1.27e-06 4.76e-07 ...
## $ blh_00 : num 48.8 155.8 452.9 44.8 36.3 ...
## $ blh_03 : num 74.2 124.7 170.7 25.1 26 ...
## $ blh_06 : num 40.5 125.1 93.1 20.2 27.3 ...
## $ blh_09 : num 35.6 1009.3 81.7 45.5 69.2 ...
## $ blh_12 : num 342 1728 420 292 267 ...
## $ blh_15 : num 307.9 1424.4 171.9 47.9 198.9 ...
## $ blh_18 : num 112.9 1015.8 38.1 17.7 94.1 ...
## $ blh_21 : num 77.9 1110.2 45 18.1 56.3 ...
## $ t2m_00 : num 276 275 277 274 271 ...
## $ t2m_03 : num 274 276 276 271 271 ...
## $ t2m_06 : num 275 275 273 269 272 ...
## $ t2m_09 : num 276 282 276 272 274 ...
## $ t2m_12 : num 279 284 281 279 279 ...
## $ t2m_15 : num 280 283 280 279 282 ...
## $ t2m_18 : num 276 281 275 275 277 ...
## $ t2m_21 : num 275 279 273 273 274 ...
## $ sp_00 : num 102170 101056 101800 102031 101822 ...
## $ sp_03 : num 102257 100961 101845 102114 101764 ...
## $ sp_06 : num 102280 101022 101850 102089 101642 ...
## $ sp_09 : num 102324 101401 101975 102250 101580 ...
## $ sp_12 : num 102100 101372 101850 102197 101246 ...
## $ sp_15 : num 101777 101572 101821 102132 101055 ...
## $ sp_18 : num 101581 101716 101912 102056 101112 ...
## $ sp_21 : num 101304 101782 102005 102002 101242 ...
## $ tcc_00 : num 3.09e-01 2.85e-01 5.55e-17 5.55e-17 3.75e-03 ...
## $ tcc_03 : num 2.55e-01 7.43e-01 5.55e-17 2.39e-01 3.90e-01 ...
## $ tcc_06 : num 4.69e-01 5.03e-01 5.55e-17 1.20e-01 9.75e-01 ...
## $ tcc_09 : num 0.7202 0.0468 0.1694 0.1446 0.7507 ...
## $ tcc_12 : num 6.49e-01 5.55e-17 3.27e-02 3.67e-01 1.40e-03 ...
## $ tcc_15 : num 4.74e-01 5.55e-17 1.29e-02 6.26e-01 5.55e-17 ...
## $ tcc_18 : num 3.41e-01 5.55e-17 5.55e-17 6.96e-01 5.55e-17 ...
## $ tcc_21 : num 3.52e-01 5.55e-17 6.10e-05 2.36e-01 5.55e-17 ...
## $ tp_00 : num -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_03 : num -8.67e-19 4.17e-07 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_06 : num -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_09 : num -8.67e-19 7.91e-06 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_12 : num 8.33e-07 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_15 : num -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_18 : num -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ tp_21 : num 4.17e-07 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
## $ wd_00 : num 1.328 -1.811 -0.335 2.696 -2.096 ...
## $ ws_00 : num 2.1 4.11 4.12 2.75 1.17 ...
## $ wd_03 : num 1.273 -1.699 -0.489 -0.25 -2.537 ...
## $ ws_03 : num 2.46 3.8 3.29 1.45 1.68 ...
## $ wd_06 : num 1.028 -0.917 -0.661 -2.219 -2.383 ...
## $ ws_06 : num 2.241 1.944 2.643 0.992 1.157 ...
## $ wd_09 : num -0.949 0.38 -1.242 -2.419 -2.109 ...
## $ ws_09 : num 0.264 4.674 1.251 0.546 1.918 ...
## $ wd_12 : num -1.977 0.138 -2.16 -1.924 -1.958 ...
## $ ws_12 : num 1.17 7.22 2.04 1.54 2.49 ...
## $ wd_15 : num -1.362 0.289 -2.803 -1.933 -1.883 ...
## $ ws_15 : num 1.4 5.14 1.47 1.11 3.84 ...
## $ wd_18 : num -1.526 0.199 -1.857 -2.821 -1.819 ...
## $ ws_18 : num 2.31 4.95 2.11 1.03 3.75 ...
## $ wd_21 : num -1.5436 -0.0215 -0.6 -2.3604 -1.4804 ...
## $ ws_21 : num 2.37 4.65 2.13 1.22 2.13 ...
## $ DEM : num 128 128 128 128 128 ...
Random forest is used to model the relationship between \(OMI-NO_2\) and the covariates. The first step is to optimize the predictor variable set that is included in the model.
For the CAMS-Global-Reanalysis data and the ERA5 data, we included the data for the whole day with 3-hour interval at the first place (0H, 3H, 6H, 9H, 12H, 15H, 18H, 21H) and let the model decide which one to use, based on model performance indices like \(R^2\). We used a grid search to try the models with data of different timesteps and compare the OOB- and CV-\(R^2\). Additionally, we looked at different combination (subsets) of the predictor variables. (Preliminary tryings, which include purely-random values in the predictor variables, suggest that total precipitation tp is as useless as random values to the response variable by looking at the variable importance output of the random forest model.) The three predictor variable combination sets are
base: CAMS-NO2, CAMS-NO, DOY, altitude, x, ybase_meteo6: base + temp, blh, tcc, sp, ws, wdbase_meteo7: base + temp, blh, tcc, sp, ws, wd, tp# all combinations
search_hour <- c("00" , "03" , "06" , "09" , "12" , "15" , "18" , "21")
search_var <- list(
# base model
base = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_") ,
# base + temperature, blh, tcc, sp, wd, ws
base_meteo6 = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_" , "t2m_" , "blh_" , "tcc_" , "sp_" , "wd_" , "ws_") ,
# base + all meteorological variables
base_meteo7 = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_" , "t2m_" , "blh_" , "tcc_" , "sp_" , "wd_" , "ws_" , "tp_")
)
# grid search over every combination
N.trees <- 500
gridSearch_hour_var <- list()
gridSearch_pb <- txtProgressBar(min = 0, max = length(search_hour) * length(search_var) , style = 3)
for(h in 1:length(search_hour)){
for(v in 1:length(search_var)){
# progress bar
setTxtProgressBar(gridSearch_pb , length(search_var) * (h-1) + v)
result_list_hv <- list()
# subset dataset for the predictor variable combination
modelInput_df.cv <- imputation_df %>%
# exclude rows with missing predictor/response value
filter(!if_any(everything() , is.na)) %>%
# only the hour in search_hour
select(x , y , OMI_NO2 , DOY , DEM , ends_with(search_hour[h]) , spatial_CV) %>%
# only the variables in search_var
select(OMI_NO2 ,contains(search_var[[v]]) , spatial_CV)
# model with full training data
rf_temp <- ranger(OMI_NO2 ~ . ,
data = modelInput_df.cv %>% select(-spatial_CV) ,
importance = "impurity" ,
num.trees = N.trees ,
keep.inbag = TRUE)
# store OOB-R2 of the model
result_list_hv$OOB_R2 <- rf_temp$r.squared
# store variable importance of the model
result_list_hv$varImportance <- sort(rf_temp$variable.importance , decreasing = TRUE)
# cross-validation model
result_list_hv$CV_R2_k <- c()
for(k in 1:k_fold){
rf_temp_cv <- ranger(OMI_NO2 ~ . ,
data = modelInput_df.cv %>%
filter(spatial_CV != as.character(k)) %>%
select(-spatial_CV) ,
importance = "impurity" ,
keep.inbag = TRUE)
rf_temp_cv_pred <- predict(rf_temp_cv ,
data = modelInput_df.cv %>%
filter(spatial_CV == as.character(k)) %>%
select(-spatial_CV))
rf_temp_cv_obs_pred <- modelInput_df.cv %>%
filter(spatial_CV == as.character(k)) %>%
select(OMI_NO2) %>%
rename(obs = OMI_NO2) %>%
mutate(pred = rf_temp_cv_pred$predictions)
result_list_hv$CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
}
result_list_hv$CV_R2 <- mean(result_list_hv$CV_R2_k)
# store results
if(v == 1) gridSearch_hour_var[[h]] <- list()
gridSearch_hour_var[[h]][[v]] <- result_list_hv
}
# set names in the list
names(gridSearch_hour_var[[h]]) <- names(search_var)
}
rm(h,v,k)
names(gridSearch_hour_var) <- paste0("HOUR_" , search_hour)
gridSearch_OOB_R2 <- gridSearch_hour_var %>%
unlist() %>%
as.data.frame() %>%
rename(.data = . , value = `.`) %>%
mutate(var = rownames(.)) %>%
filter(grepl("OOB_R2$" , var)) %>%
rename(OOB_R2 = value) %>%
separate(col = var , into = c("Hour" , "model" , "var") , sep = "\\.") %>%
select(-var) %>%
mutate(Hour = gsub("HOUR_" , "" , Hour)) %>%
as_tibble()
gridSearch_CV_R2 <- gridSearch_hour_var %>%
unlist() %>%
as.data.frame() %>%
rename(.data = . , value = `.`) %>%
mutate(var = rownames(.)) %>%
filter(grepl("CV_R2$" , var)) %>%
rename(CV_R2 = value) %>%
separate(col = var , into = c("Hour" , "model" , "var") , sep = "\\.") %>%
select(-var) %>%
mutate(Hour = gsub("HOUR_" , "" , Hour)) %>%
as_tibble()
cowplot::plot_grid(
# visualization: OOB-R2
gridSearch_OOB_R2 %>%
ggplot(aes(x = Hour , y = model , fill = OOB_R2)) +
geom_raster() +
geom_text(aes(label = round(OOB_R2 , 3)) , color = "white" , size = 3) +
coord_cartesian(expand = FALSE) +
scale_fill_gradientn(colours = RColorBrewer::brewer.pal(5 , "YlGnBu")) +
labs(x = "Hour (CAMS and ERA5)" , y = "Predictor variable set" ,
fill = expression("OOB-R"^2)) ,
# visualization: CV-R2
gridSearch_CV_R2 %>%
ggplot(aes(x = Hour , y = model , fill = CV_R2)) +
geom_raster() +
geom_text(aes(label = round(CV_R2 , 3)) , color = "white", size = 3) +
coord_cartesian(expand = FALSE) +
scale_fill_gradientn(colours = RColorBrewer::brewer.pal(7 , "BuGn")) +
labs(x = "Hour (CAMS and ERA5)" , y = "Predictor variable set" ,
fill = expression("CV-R"^2)) ,
ncol = 2
)
From the result of the grid search over the hours, we can first observe that the models using CAMS- and ERA5-variables at 12H and 15H have the higher \(R^2\). Therefore in the next step we further look at different combination of predictor variables at 12H. However note that although the models including meteorological variables return higher OOB-\(R^2\) compared to base, it’s not the case when we look at CV-\(R^2\). The models with meteorological variables might be overfitting. Consequently in the next step we also compare base model with the other models with more predictor variables.
Here we first include all the predictor variables (12H for CAMS and ERA5) in a model (full model), and record the order of variable importance reported by the model. Then a grid search is applied where the i least important variables are removed at a time. Additionally, like mentioned above, the base model which does not consider any meteorological variables is also compared.
modelInput_df <- imputation_df %>%
select(x , y , DOY , OMI_NO2 , DEM , ends_with("_12") , spatial_CV) %>%
# exclude rows with missing predictor/response value
filter(!if_any(everything() , is.na))
# grid search: exclude i least important variables at a time
# initial: all predictor variables
rf_initial <- ranger(OMI_NO2 ~ x + y + DOY + DEM + CAMS_NO2_12 + CAMS_NO_12 +
blh_12 + t2m_12 + sp_12 + tcc_12 + tp_12 + wd_12 + ws_12 ,
data = modelInput_df ,
importance = "impurity" ,
keep.inbag = TRUE)
inputVar_ordered <- rf_initial$variable.importance %>%
sort(decreasing = TRUE) %>%
names()
rm(rf_initial)
# grid search
gridSearch_12H <- list()
gridSearch_i <- 1:7
N.trees <- 500
for(i in gridSearch_i){
gridSearch_12H[[i]] <- list()
# exclude i-1 least importance variables (starting from initial aka excluding no variable)
formula_rf.temp <- sprintf("OMI_NO2 ~ %s" ,
paste(inputVar_ordered[1:(length(inputVar_ordered)-i+1)] , collapse = '+')) %>%
formula() # making the formula object
# also try: no ERA-5 at all
if(i == max(gridSearch_i)) formula_rf.temp = OMI_NO2 ~ x + y + DOY + DEM + CAMS_NO2_12 + CAMS_NO_12
gridSearch_12H[[i]]$formula <- formula_rf.temp
# train model
rf_temp <- ranger(formula_rf.temp ,
data = modelInput_df ,
importance = "impurity" ,
num.trees = N.trees ,
keep.inbag = TRUE)
# store OOB-R2 result
gridSearch_12H[[i]]$OOB_R2 <- rf_temp$r.squared
# store importance
gridSearch_12H[[i]]$varImportance <- rf_temp$variable.importance
# cross validation
gridSearch_12H[[i]]$CV_R2_k <- c()
gridSearch_12H[[i]]$CV_slope_k <- c()
gridSearch_12H[[i]]$CV_RMSE_k <- c()
for(k in 1:k_fold){
# model
rf_temp_cv <- ranger(formula_rf.temp ,
data = modelInput_df %>%
filter(spatial_CV != as.character(k)) ,
importance = "impurity" ,
keep.inbag = TRUE)
# prediction on test set
rf_temp_cv_pred <- predict(rf_temp_cv ,
data = modelInput_df %>%
filter(spatial_CV == as.character(k)) )
rf_temp_cv_obs_pred <- modelInput_df %>%
filter(spatial_CV == as.character(k)) %>%
select(OMI_NO2) %>%
rename(obs = OMI_NO2) %>%
mutate(pred = rf_temp_cv_pred$predictions)
# calculate CV-R2
gridSearch_12H[[i]]$CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
# calculate CV-slope
gridSearch_12H[[i]]$CV_slope_k[k] <- lm(obs ~ pred , data = rf_temp_cv_obs_pred)$coefficients[2]
# calculate CV-RMSE
gridSearch_12H[[i]]$CV_RMSE_k[k] <- rf_temp_cv_obs_pred %>%
summarize(RMSE = sqrt(sum(((pred-obs)^2)/n())) ) %>%
unlist
# clean
rm(rf_temp_cv , rf_temp_cv_pred , rf_temp_cv_obs_pred)
}
# result: mean CV-R2, mean slope, mean RMSE
gridSearch_12H[[i]]$CV_R2 <- mean(gridSearch_12H[[i]]$CV_R2_k)
gridSearch_12H[[i]]$CV_slope <- mean(gridSearch_12H[[i]]$CV_slope_k)
gridSearch_12H[[i]]$CV_RMSE <- mean(gridSearch_12H[[i]]$CV_RMSE_k)
# clean
rm(formula_rf.temp , rf_temp)
}
rm(i,k)
names(gridSearch_12H) <- paste0("subset" , gridSearch_i - 1)
cowplot::plot_grid(
# visaulization of grid search result: OOB- and CV-R2
data.frame(
subset = gridSearch_i - 1 ,
OOB_R2 = sapply(gridSearch_12H , function(mylist)mylist$OOB_R2 ) ,
CV_R2 = sapply(gridSearch_12H , function(mylist)mylist$CV_R2 )
) %>%
pivot_longer(cols = c(OOB_R2 , CV_R2) , names_to = "var") %>%
ggplot(aes(x = subset , y = value , color = var) ) +
geom_point() +
geom_line() +
scale_color_discrete(labels = c(CV_R2 = "Spatially-blocked cross validation" , OOB_R2 = "Out-of-bag")) +
labs(x = "# variables excluded" , y = expression("R"^2) ,
color = "Test set") +
theme_bw() +
theme(legend.position = "right") ,
# visaulization of grid search result: CV-RMSE
data.frame(
subset = gridSearch_i - 1 ,
CV_RMSE = sapply(gridSearch_12H , function(mylist)mylist$CV_RMSE )
) %>%
ggplot(aes(x = subset , y = CV_RMSE)) +
geom_point() +
geom_line() +
labs(x = "# variables excluded" , y = "CV-RMSE") +
theme_bw() ,
align = "v" , axis = "lr" , ncol = 1 , rel_heights = c(6,4)
)
The models on the x axis are respectively:
tpx - tpws - x - tptcc - ws - x - tpwd - tcc - ws - x - tpIn this case, we can see that the model with the highest CV-\(R^2\) and CV-RMSE is the base model, although the OOB-\(R^2\)s are all similar. It indicates that including the meteorological variables may not improve the model and introduce potential overfitting to the model. From this observation, we decide to drop the meteorological variables for the imputation model for OMI and only include CAMS-NO2, CAMS-NO, DOY, altitude, x, y.
Here we go back and look at 2.1.1 Result again, the model using 15H CAMS data (15H, base model) actually has higher OOB- and CV-\(R^2\). Therefore we adopt this for predictor variables of the final model.
The final model is \(OMI_{ij} \sim CAMS_{15}(NO_2)_{ij} + CAMS_{15}(NO)_{ij} + altitude_{i} + DOY{j} + x_i + y_i\), which uses the predictor variables:
formula_rf_final <- OMI_NO2 ~ CAMS_NO2_15 + CAMS_NO_15 + DEM + DOY + y + x
And the hyperparameters of the random forest model are chosen to be: num.trees=1000 and mtry=5 after grid search over possible candidate value combinations. (The tedious process was not demonstrated here.)
modelInput_df <- imputation_df %>%
select(x , y , DOY , OMI_NO2 , DEM , ends_with("_15") , spatial_CV) %>%
# exclude rows with missing predictor/response value
filter(!if_any(everything() , is.na))
rf_final <- ranger(formula_rf_final ,
data = modelInput_df ,
num.trees = 1000 ,
mtry = 5 ,
importance = "impurity" ,
keep.inbag = TRUE)
rf_final
## Ranger result
##
## Call:
## ranger(formula_rf_final, data = modelInput_df, num.trees = 1000, mtry = 5, importance = "impurity", keep.inbag = TRUE)
##
## Type: Regression
## Number of trees: 1000
## Sample size: 29354
## Number of independent variables: 6
## Mtry: 5
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 1.339053e+30
## R squared (OOB): 0.7601374
rf_final$r.squared
## [1] 0.7601374
rf_final_pred <- rf_final$predictions # OOB-predictions
rf_final_obs_pred <- modelInput_df %>%
select(x,y,DOY,OMI_NO2) %>%
rename(obs = OMI_NO2) %>%
mutate(pred = rf_final_pred)
summary(lm(obs ~ pred , data = rf_final_obs_pred))
##
## Call:
## lm(formula = obs ~ pred, data = rf_final_obs_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.474e+16 -4.070e+14 2.831e+12 3.336e+14 1.988e+16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.037e+14 9.527e+12 -10.88 <2e-16 ***
## pred 1.042e+00 3.404e-03 306.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.154e+15 on 29352 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7614
## F-statistic: 9.368e+04 on 1 and 29352 DF, p-value: < 2.2e-16
rf_final_obs_pred %>%
ggplot(aes(x = pred , y = obs)) +
geom_abline(slope = 1 , intercept = 0 , linetype = 2 , color = "azure4") +
geom_point(shape = 1 , alpha = 0.5) +
geom_smooth(method = "lm") +
coord_fixed(1) +
labs(x = expression("Modeled OMI-NO"[2]) , y = expression("Observed OMI-NO"[2])) +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
rf_final_CV_R2_k <- c()
for(k in 1:k_fold){
# model
rf_temp_cv <- ranger(formula_rf_final ,
data = modelInput_df %>%
filter(spatial_CV != as.character(k)) ,
importance = "impurity" ,
keep.inbag = TRUE ,
num.trees = rf_final$num.trees ,
mtry = rf_final$mtry)
# prediction on test set
rf_temp_cv_pred <- predict(rf_temp_cv ,
data = modelInput_df %>%
filter(spatial_CV == as.character(k)) )
rf_temp_cv_obs_pred <- modelInput_df %>%
filter(spatial_CV == as.character(k)) %>%
select(OMI_NO2) %>%
rename(obs = OMI_NO2) %>%
mutate(pred = rf_temp_cv_pred$predictions)
# calculate CV-R2
rf_final_CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
# clean
rm(rf_temp_cv , rf_temp_cv_pred , rf_temp_cv_obs_pred)
}
# mean CV-R2
mean(rf_final_CV_R2_k)
## [1] 0.5949998
range(rf_final_CV_R2_k)
## [1] 0.4836815 0.7082183
10-fold-CV-\(R^2\): 0.595 (0.484—0.708)
Use the full datasets (all pixel-days) and predict OMI_NO2 with the final model:
projected_rf <- predict(rf_final ,
data = imputation_df %>%
filter(!is.na(DEM)) ,
type = "se")
Prepare the projected data as data frame (along with the predictor variables):
projected <- imputation_df %>%
filter(!is.na(DEM)) %>%
select(x,y,DOY,OMI_NO2) %>%
# add the projected (model-predicted) OMI-NO2
mutate(OMI_NO2_pred = projected_rf$predictions ,
OMI_NO2_pred_se = projected_rf$se) %>%
# set negative values to 0
mutate(OMI_NO2_pred = ifelse(OMI_NO2_pred < 0 , 0 , OMI_NO2_pred))
Convert the data frame to spatialtemporal array:
projected_stars <- projected %>%
# convert DOY to date
mutate(DOY = as_date(DOY , origin = "2018-12-31")) %>%
# convert to stars
st_as_stars(dims = c("x" , "y" , "DOY")) %>%
# rename dimension DOY to date
st_set_dimensions(3 , names = "date") %>%
# set coordinate system
st_set_crs(st_crs(OMI_raw)) %>%
# de-select the attribute "DOY"
select(-DOY) %>%
# imputed: observed + predicted
mutate(OMI_NO2_imputed = ifelse(is.na(OMI_NO2) , OMI_NO2_pred , OMI_NO2))
Here we can look at the final projection result:
projected_stars
## stars object with 3 dimensions and 4 attributes
## attribute(s):
## OMI_NO2 OMI_NO2_pred OMI_NO2_pred_se
## Min. :-9.982e+14 Min. :0.000e+00 Min. :1.235e+12
## 1st Qu.: 4.428e+14 1st Qu.:7.877e+14 1st Qu.:2.010e+14
## Median : 1.363e+15 Median :1.759e+15 Median :3.978e+14
## Mean : 1.959e+15 Mean :2.187e+15 Mean :5.428e+14
## 3rd Qu.: 2.714e+15 3rd Qu.:3.011e+15 3rd Qu.:7.151e+14
## Max. : 5.517e+16 Max. :4.384e+16 Max. :1.200e+16
## NA's :67006 NA's :25915 NA's :49026
## OMI_NO2_imputed
## Min. :-9.982e+14
## 1st Qu.: 7.309e+14
## Median : 1.734e+15
## Mean : 2.181e+15
## 3rd Qu.: 3.033e+15
## Max. : 5.517e+16
## NA's :25915
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 22 5.75 0.25 GEOGCS["unnamed ellipse",... NA NULL [x]
## y 1 12 48.25 -0.25 GEOGCS["unnamed ellipse",... NA NULL [y]
## date 1 365 2019-01-01 1 days Date NA NULL
projected_stars is a 3D array (x,y,date) with 4 attributes. OMI_NO2 is the raw OMI pixels, OMI_NO2_pred is the model-predicted values, OMI_NO2_pred_se is the prediction standard error reported by ranger (the random forest model), and OMI_NO2_imputed is the final imputed product. Pixels that exist in the original OMI observation stay the same in OMI_NO2_imputed, whereas missing pixels are filled in with the model-predicted values (OMI_NO2_pred).
We can visualize the projection result (of some random dates):
selected_dates <- interval("2019-02-01" , "2019-02-10")
ggplot() +
geom_stars(
data = projected_stars %>%
filter(date %within% selected_dates) %>%
merge() %>%
st_set_dimensions(4 , values = c("observed" , "predicted" , "se" , "imputed"))
) +
geom_sf(data = CH , fill = NA , color = "white") +
scale_fill_viridis_c(limits = c(0,1e16), oob = scales::squish) +
coord_sf(expand = FALSE) +
facet_grid(attributes~date) +
labs(x = "Longtitude" , y = "Latitude" , fill = "TropColumnNO2") +
theme(axis.text = element_text(size = 6))
As an alternative for the visualization, an animation is used:
library(gganimate)
annimation_obs_pred <- ggplot() +
geom_stars(
data = projected_stars %>%
select(OMI_NO2 , OMI_NO2_pred , OMI_NO2_imputed) %>%
merge() %>%
st_set_dimensions(4 , values = c("observed" , "predicted" , "imputed"))
) +
geom_sf(data = CH , fill = NA , color = "white") +
scale_fill_viridis_c(limits = c(0,1e16), oob = scales::squish) +
coord_sf(expand = FALSE) +
facet_grid(~attributes) +
transition_time(date) +
labs(x = "Longtitude" , y = "Latitude" , fill = "TropColumnNO2" ,
title = "Date: {frame_time}") +
theme(axis.text = element_text(size = 6))
animate(annimation_obs_pred , duration = 365 , fps = 1)